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ABSTRACT 


Three  methods  of  defining  optimality  of  statistical  decision 
rules  are  introduced.  The  first  uses  ideas  of  approximation 
theory  by  defining  the  optimal  decision  as  that  element  of  the 
risk  set  which  best  approximates  an  ideal  rule.  The  second 
optimality  principle  defines  optimality  in  terms  of  minimizing 
functionals.  The  third  method  is  the  axiomatization  of  optimality 
in  statistical  decision  theory. 


MATHEMATICAL  MODELS  FOR  STATISTICAL  DECISION  THEORY 

Bernard  Harris 


1.  Introduction  .  In  the  typical  problem  of  statistical  inference,  the 
statistician  is  confronted  with  the  problem  of  selecting  one  out  of  the  vast 
number  of  possible  decision  rules.  I  will  refer  to  this  as  the  "fundamental 
problem  of  statistics".  In  this  survey  paper,  I  will  try  to  give  some  mathe¬ 
matical  characterizations  of  the  problem  of  selecting  an  optimal  decision  pro¬ 
cedure.  This  will  not  constitute  a  resolution  of  the  "fundamental  problem", 
inasmuch  as  this  is  intimately  tied  up  with  irresolvable  philosophical  dif¬ 
ficulties.  These  arise,  since  in  all  but  a  few  exceptional  problems,  there  is 
no  single  procedure  which  can  be  regarded  as  uniformly  dominating  all  other 
possible  procedures.  As  a  consequence,  "reasonable  people"  have  disagreed 
and  will  continue  to  disagree  on  the  specific  procedure  that  should  be  selected. 

Despite  all  of  the  above  difficulties,  a  great  deal  can  still  be  done  to 
formalize  statistical  decision  theory.  Thus,  in  what  follows,  I  will  give  some 
characterizations  of  optimality  in  statistical  inference.  The  discussion  will 
of  necessity  be  brief  and  is  intended  only  to  provide  an  introduction  to  these 
characterizations.  More  extensive  treatments  are  in  preparation  and  will  appear 
subsequently. 

The  objective  here  will  be  to  provide  a  mathematical  structure  in 
which  the  details  of  the  original  problem  are  replaced  by  abstract  mathematical 
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statements  which  retain  those  features  common  to  large  classes  of  problems. 
This  enables  us  to  isolate  those  aspects  of  decision  theory  which  are  relevant 
to  the  selection  of  a  single  decision  rule. 

ror  the  most  part,  this  paper  does  not  really  contain  new  mathematical 
results;  instead  it  is  largely  concerned  with  the  adaptation  of  known  mathe¬ 
matical  results  to  statistical  problems. 

Specifically,  I  will  describe  three  ways  of  providing  a  mathematical 
model  for  statistical  decision  theory.  The  first  is  motivated  by  comparatively 
recent  ideas  of  approximation  theory.  The  second  is  obtained  by  representing 
the  statistical  problem  as  an  optimization  problem.  The  third  approach  is  a 
discussion  of  some  possible  axiomatizations  of  statistical  decision  theory. 

2.  Preliminaries.  Let  0  and  G  be  topological  spaces;  0  and  a  will  be 
used  to  denote  generic  elements  of  these  spaces.  Let  L(0,  a)  be  a  mapping 
from  0  xG  into  the  reals.  We  will  assume  throughout  that  L(0,a)  is  uni¬ 
formly  bounded  from  below;  that  is,  there  exists  a  real  number  M  >  0,  such 
that  L(0,a)>-M  for  all  0  «  ©,  a  «G  .  It  is  customary  to  refer  to  ©  as  the 
parameter  space,  a  as  the  space  of  actions,  and  L  as  the  loss  function. 

We  also  require  a  probability  space  (X,  8^,  Pfl),  where  I  is  a  space  of 
elements  x  and  8Y  is  a  <r -algebra  of  subsets  of  X,  and  P  ,  0  €  ©  is  one 
of  a  family  of  probability  measures  on  8^  indexed  by  ©  . 

An  experiment  is  conducted  and  the  random  variable  X  is  observed. 

X  is  assumed  to  take  values  in  X  and  is  distributed  by  P  ,  where  0  is 
an  element  of  e  whose  value  is  unknown  to  the  statistician. 
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We  can  now  outline  the  steps  in  a  statistical  problem.  The  statistician 


observes  the  event  X  =  x,  x  t  X  ;  then  given  the  family  of  measures  Pa  , 

U 

but  not  the  value  of  9 ,  he  selects  a  e  G  and  is  assessed  the  penalty 
L(0,  a)  .  Thus,  the  objective  is  to  choose  a  so  that  L(0,  a)  is  "kept  small"; 
ideally,  if  there  exists  a  such  that  L (0,  a  )  =  min  L(0,  a),  then  one  should 

U  v  A 

ae  G 

choose  aQ  .  However,  it  is  obvious  that  in  order  to  be  able  to  choose  aQ  , 
in  general,  one  would  need  to  know  which  parameter  0  «  ©  prevailed.  Since 
X  is  distributed  by  P.,  the  event  X  =  x  contains  information  about  0,  hence 
the  choice  of  a  t  G  should  generally  depend  on  the  outcome  of  the  experiment. 
Thus,  given  the  outcome  of  the  experiment,  a  mapping  6:  I-*  0  is  chosen. 
Since  X  is  a  random  variable,  o(X)  is  a  random  variable,  provided  that  6 
is  a  measurable  mapping.  We  denote  the  set  of  such  measurable  mappings 
by  £  . 

In  repetitions  of  the  experiment,  X  will  change;  hence,  unless 

6(X)  is  almost  surely  constant,  L(0,  6(X))  will  vecv.  Thus,  the  average  loss, 

Eg  L(0,  6(X))  is  used  as  a  criterion  for  choosing  6  rather  than  the  loss  in 

any  given  experiment.  We  call  E„  L<0,  6(X))  the  risk  function  (of  6)  and 

denote  it  by  Rg(6)  =  R(0,  6)  •  Note  that  our  assumptions  (6  measurable  and 

L(0,  a)  uniformily  bounded  from  below)  insure  that  R ,(0)  "exists"  for  each 

0 

9;  it  may  however,  be  +<«  . 

It  is  desirable  to  augment  i  by  introducing  the  "mixed"  or  "random¬ 
ized"  decision  procedures  $,  whose  elements  will  be  denoted  by  <p  .  To 
do  this,  we  introduce  a  <r-algebra  and  let  «  be  the  set  of  all  probability 
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distributions  on  .  Of  necessity  ^  must  include  each  {6},  6  e  6  , 
so  that  &C  4,  by  using  distributions  such  that  P{&}  =  1  .  We  will  refer 
to  the  elements  of  *  as  decision  rules  or  decision  procedures;  when  it  is 
necessary  ro  specifically  identify  an  element  of  $  as  being  in  S,  we  will 
refer  to  it  as  a  pure  decision  rule. 

Clearly,  if  R(0,  6)  is  measurable  for  each  0, 

iye)  =  R<0,*)  =  /  R(0,  6)dp(6) 

is  well-defined.  We  call  R  (0),  as  defined  above,  the  risk  function  of  <p  . 

<P 

Then,  the  problem  of  selecting  a  decision  procedure  becomes  the  problem  of 

selecting  e  $  so  that  R(0,  <p)  is  kept  "  small" . 

Let  S  =  {R  (0),  i}  and  let  T  be  the  mapping  defined  by 
<p 

T:  $  —  S,  that  is  T (<p)  -  R  (0)  .  We  refer  to  S  as  the  risk  set. 

If  for  q>y  <?2  €  <J>,  we  have  R (0,  )  =  R(0,  <p 2)  for  all  0  «  e,  then  <p^ 
and  <p2  are  said  to  be  equivalent.  Thus,  the  elements  of  S  are  equivalence 
classes  of  elements  of  4  .  Consequently,  we  can  replace  the  problem  of 
selecting  <p  «  4  with  the  equivalent  problem  of  selecting  an  element  s  «  S  . 
Then  we  can  choose  any  element  of  T  \s)  as  the  decision  procedure  to  be 
employed. 

In  order  to  simplify  notation,  we  will  adopt  the  following  conventions 
in  the  material  that  follows.  For  s^,  s2  e  S,  Sj  <  s2  means  T’^Sj)  = 

R(0,  <p^)  <  R(02,  <?2)  =  T  *(s2)  for  all  0  e  0  and  for  some  0Q  t  e, 

R(0q,  «>j)  <  R(0Q,  i?2);  here  is  any  element  of  T  ^(s^)  and  ?>2  is  any 
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element  of  T_1(s2)  .  Similarly,  Sj  <  s2  means  R(0,  <  R(0,  <p 2)  for  all 

e  c  e  . 

In  particular,  we  will  say  that  Sj  e  S  is  inadmissible  (and  all 
<p  t  T~*  (s^)  are  inadmissible)  if  there  is  an  sQ  e  S  with  sQ  <  s^  .  If  there 
is  no  such  sQ,  then  s^,  equivalently  any  <p  «  T  ^s^),  will  be  said  to  be 
admissible. 

Clearly,  if  s^  <  s2,  then  s2  should  not  be  employed  by  the  statistician. 
Further,  if  there  is  an  sQ  such  that  sQ  <  s,  for  all  s  e  S;  then  sQ  is  to 
be  chosen  and  there  is  no  problem  of  selection.  This  is,  unfortunately,  an 
exceptional  situation.  In  most  problems,  an  ordering  of  this  type  is  not 
available  among  the  decision  rules  being  considered;  it  is  quite  customary  to 
find  oneself  with  the  problem  of  selecting  one  of  a  large  set  of  mutually  in¬ 
comparable  rules. 

Note  that  S  is  of  necessity  a  convex  set.  To  see  this,  observe  that 
$  is  a  convex  set.  Then  let  <pQ  =  +  (1  -  <Py  <P2  e  x  £  *1  • 

Then 

R  (0)=/  R(0,  6)  d[(\v>  +  (1  -X)  ^_)(6)] 

90  &  12 

=  /  R(0,  6)[\d?1(6)  +  Xd*2(6)] 

=  X. R  (0)  +  (l-\)  R  (0)  . 

9l  *2 

Since  T  is  a  linear  mapping,  we  have  that  if 

s0  =  T<V  * 
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then 


So“  V*-X>S2  - 

where 

S1  =  T<’’1)'  *2bT<*2>  • 


However,  as  S  has  been  defined,  it  is  not  necessarily  the  convex  hull  of 
T(fl),  denoted  by  co(T(£))  .  To  see  this,  let  6  =  {-«  <  6  <  oo  }  , 

6  =  {-oo<0<oo}  and 


R6(0)  = 


e  <  6 
e  >  6  . 


That  is,  T(f>)  is  the  set  of  all  degenerate  cumulative  distribution  functions. 

It  is  easily  seen  that  co(T($))  is  the  set  of  all  cumulative  distribution 
functions  with  a  finite  number  of  jumps.  However,  S  is  the  set  of  all  cumu¬ 
lative  distribution  functions  on  (-00,00)  . 

In  the  discussion  that  follows,  the  "classical"  decision  criteria 
known  as  the  minimax  criterion,  minimax  regret  criterion,  and  Laplace's 
criterion  will  be  used  repeatedly  as  illustrations.  For  the  sake  of  complete¬ 
ness,  they  are  defined  below.  Each  of  these  corresponds  to  a  different  in¬ 
terpretation  of  what  might  be  meant  by  "keeping  R  ( 9 )  small". 

<P 

(1) .  The  minimax  criterion  .  Choose  e  ®  so  that  supR  (0)  < 

°  d  *0  ~ 

sup  R  (0)  for  all  <p  «  4>  • 

0  * 

(2) .  The  mlnlmax  regret  criterion.  Choose  <p^  t  4>  so  that 
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sup[R  (0)  -  inf  R  (0)]  <  sup[R  (B)  -inf  R  (B)] 
B  ^ 0  yt  9  ^  0  ^  ye  9  ^ 


for  all  0  e  9  . 


(3).  Laplace's  criterion.  Choose  <pQ  e  9  so  that 

/  R  (0)  dp(0)  <  f  R  (0)dp(0) 

©  *0  e  * 

for  all  <p  e  9 ,  where  p.  is  the  uniform  measure  on  ©  .  (If  ©  is  a  compact 

subset  of  E  ,  for  example,  f  R  (0)d|i(0)  is  well-defined.  In  more  general 

©  v 

situations,  some  modifications  to  this  definition  may  be  required). 

In  each  of  the  three  cases  above  sQ,  the  "optimal"  element  of  s  , 
is  defined  by  sQ  =  T(<pQ)  . 

Each  of  these  reflects  a  different  interpretation  of  "keeping  R  (0) 

9 

small",  in  ignorance  of  the  value  of  0  .  The  minimax  criterion  guarantees 

that  the  largest  value  of  R  (0)  is  as  small  as  possible.  The  minimax  regret 

9 

criterion  identifies  inf  R  (0),  the  lower  envelope  function,  as  the  smallest 
<pt  9  v 

loss  that  you  could  incur  if  you  knew  0;  hence  R  (0)  -  inf  R  (0)  is  the 

^  yt  9  ^ 

additional  loss  that  is  incurred  by  one's  ignorance  of  0  .  Then  you  seek  to 


make  the  maximum  of  this  difference  as  small  as  possible.  In  Laplace's 
criterion,  the  philosophy  is  one  of  keeping  the  "average  loss"  small  rather 
than  the  maximum  —  as  is  the  case  with  rninimax  and  minimax  regret. 

Average  is  here  identified  with  the  uniform  measure  on  ©  . 

Historically,  there  have  been  two  approaches  employed  in  studying  the 
question  of  what  might  be  meant  by  a  "best"  s  c  S  .  They  are 
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(1)  Specify  a  criterion,  then  deduce  its  properties  and  see  if  they 
are  satisfactory. 

(2)  List  the  properties  that  you  would  like  a  decision  procedure  to 
possess,  then  determine  the  existence  and  construction  of 
decision  rules  meeting  the  required  conditions. 

In  what  follows,  the  third  and  fourth  sections  will  follow  the  first  approach 
and  the  fifth  section  will  use  the  last  approach. 

For  additional  material  on  the  above  definitions  and  concepts,  the 
reader  is  referred  to  standard  treatises  on  decision  theory,  such  as  D.  Blackwell 
and  M.  A.  Girshick  [  2  ],  T.  S.  Ferguson  [11  ],  and  A.  Wald  [30]. 

3.  Approximation  theory  and  statistical  decision  theory 

Let  £  be  a  normed  linear  space  of  real -valued  functions  of  0,  6  e  e  ; 
for  x  c  £,  we  denote  the  norm  of  x  by  llx||£  .  When  there  is  no  danger 
of  confusion  concerning  the  space  under  consideration,  the  subscript  £ 
will  be  deleted.  Let  S^,  be  a  convex  set  in  £  and  let  v  be  a  distinguished 
point  in  £ . 

We  say  that  sQ  €  S£  is  a  best  approximation  to  v  if  II  sQ  -  v||  <  II  s  -  v|| 
for  all  s  €  S.  .  The  existence  and  determination  of  s  is  a  well-known 
problem  in  approximation  theory  and  there  is  a  considerable  literature  about 
this  topic.  A  few  of  the  more  significant  of  these  results  are  summarized 
below.  If  we  add  the  additional  assumption  that  v  <  s  for  all  s  «  S£  , 
then,  we  will  show  that  the  notion  of  a  best  approximation  to  v  is  a  possible 
interpretation  of  the  concept  of  optimality  in  statistical  decision  theory. 
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Consequently,  the  theorems  of  this  part  of  approximation  theory  frequently 
hav-J  a  natural  reinterpretation  in  a  statistical  context. 

For  general  discussions  of  properties  of  convex  sets  in  normed  linear 
spaces,  the  reader  is  referred  to  F.  A.  Valentine  [29]  and  N.  Dunford  and 
J.  T.  Schwartz,  Chapter  V,  [  9  ]•  The  theory  of  best  approximations  by 
elements  of  convex  sets  is  discussed  in  the  book  by  I.  Singer  [26],  (in 
particular  see  Appendix  I)  and  in  papers  by  F.  R.  Deutsch  and  P.  H.  Maserick 
[8  ],  A.  L.  Garkavi  |12,13],  V.  N.  Burov  [3, 4],  and  G.  S.  Rubinstein  [23],  to 
name  a  few. 

We  denote  a  hyperplane  H  in  £  as  a  set  of  the  form 

H  =  {xt  £  •  L(x)  =  c) 

* 

where  L  «  £  ,  the  adjoint  space  of  £,  L  *  0,  and  c  is  a  real  scalar. 

Then,  the  best  approximation  in  S£  to  v  can  be  characterized  by 
the  following  theorems,  which  will  be  stated  here  without  proofs. 

Theorem  3. 1  (I.  Singer  [26],  F.  R.  Deutsch  and  P.  H.  Maserick[8  ]).  Let 

»£be  a  convex  set  in  £,  a  normed  linear  space,  and  let  v  «  the 

complement  of  S_  -  Then,  there  exists  an  s  e  S_  which  is  a  best  ap- 
S*  0  w 

♦ 

proximation  to  v  if  and  only  if  there  exists  a  linear  functional  L«£  with 

(1)  IIlII  =  1  , 

(2)  L(s  )  =  inf  L(s)  , 

scSf 

(3)  L(Sq-v)  =  II  sQ  -  v ||  . 

#1160  -9- 


Geometrically,  this  says  that  a  point  sQ  in  is  a  best  approximation 
to  v  «  Sp  if  and  only  if  there  is  a  hyperplane  H  separating  v  from  S.  , 
which  supports  at  sQ,  and  whose  distance  from  v  is  the  distance  from 
v  to  sQ  . 

Secondly,  we  have  the  following  characterization  of  a  best  approxima¬ 
tion. 

Theorem  3.  2  (A.  L.  Garkavi  [13  ]).  Let  S£  be  a  convex  set  in  the  normed 

Q 

linear  space  £.  Then  s  «  S,  is  the  best  approximation  to  v  t  S_  if  and 

0£  X 

jjt 

only  if  for  each  s  c  Sp  there  is  a  linear  functional  L  in  £  such  that 

X  s 

sjc 

(1)  L  is  an  extreme  point  of  the  closed  unit  ball  in  £  , 

s 

(2)  Ls(s-s0)>0  , 

(3)  Ls(sQ  -  v)  =  l!s0  -  vj|  . 

We  now  turn  to  the  connection  between  the  best  approximation  problem 
described  above  and  optimality  in  statistical  decision  theory. 

Let  S  be  the  risk  set  of  a  statistical  decision  problem  and  let  £  be 
a  normed  linear  space.  Let  S^=  S  0  £  =  {s  e  S,  II  sll  ^  <  °°  }  .  We  say  that 
Sq  «  S  is  (v,  £)  optimal  if  for  v  <  s,  for  all  s  «  S^,  sQ  is  the  best 
approximation  to  v  from  S-  .  If  S  is  empty,  then  we  define  every  st  S 
as  (v,  C)  optimal. 

It  remains  to  be  shown  that  this  is  in  fact  a  reasonable  definition  of 
optimality  for  statistical  decision  problems.  We  will  try  to  justify  this 
in  two  steps.  First,  we  will  show  that  minimax,  minimax  regret,  and 
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Laplace's  criterion  are  (v,  £)  optimal  procedures  for  particular  choices  of 
v  and  £  .  Second,  having  established  that  this  is  in  fact  a  generalization 
of  these  "classical"  decision  criteria,  we  will  give  an  intuitive  interpreta¬ 
tion  of  the  notion  of  (v,  £  )  optimality  as  a  family  of  decision  criteria.  We 
will  make  the  simplifying  assumption  that  e  is  compact.  Modifications  to 
some  definitions  will  be  needed,  when  this  is  not  the  case,  and  can  easily 
be  made.  However,  these  will  be  omitted  here,  since  they  do  not  serve 
the  immediate  purpose  of  this  exposition. 

Let  v  =  v(0)  =  -M,  dee.  Then,  if  we  take  £  to  be  the  space  of 

bounded  functions  f(0),  dee  with  ||f||  =  suplf(0)|  .  Then,  the  best 

£  0 

approximation  to  v  from  S,  that  is,  the  (v,  S)  optimal  decision  rule  for 
this  case  is  the  minimax  decision  procedure.  Note  that  our  hypotheses 
insure  v  <  s  for  all  s  e  S  . 

Now  employ  the  same  choice  of  v  and  let  £  be  the  space  of  p  - 
integrable  functions  of  0,  where  p  is  the  uniform  measure  on  0  .  Here 
the  (v,  £)  optimal  decision  rule  is  Laplace's  criterion. 

To  obtain  the  identification  of  minimax  regret  as  a  (v,  £)  optimal  de¬ 
cision  procedure,  we  define  v  =  v(0)  =  inf  R  (0),  the  lower  envelope 

fp 

<pe  9 

function.  Clearly  v  <  s,  for  all  s  e  S  .  Then  the  same  choice  of  £  as  in 
the  representation  of  the  minimax  criterion  provides  the  representation  of  the 
mlnlmax  regret  criterion. 

Hence,  it  is  evident  that  this  notion  oi  optimality  is  a  generalization 
of  some  of  the  familiar  notions  of  optimality. 
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We  now  give  an  intuitive  interpretation  of  (v,  £)  optimality  as  a 
statistical  optimality  criterion.  The  statistician  should  interpret  v  as  the 
"ideal"  decision  rule,  that  is,  what  he  would  like  to  be  able  to  accomplish, 
such  as  in  the  case  of  "  perfect  information".  The  distance  from  v  to  S 
reflects  his  inability  to  accomplish  this  ideal  as  a  consequence  of  uncertainty. 
Since  v  in  general  is  not  attainable,  the  suggestion  is  to  choose  that  element 
of  S  which  comes  as  close  as  possible  to  the  ideal  v,  hence,  "a  best 
approximation  to  v  " . 

To  clarify  these  ideas,  consider  the  following  simple  example.  We 
consider  a  decision  theoretic  model  for  the  problem  of  testing  a  simple  hypothesis 
against  a  simple  alternative.  Thus  e  =  (0^,  0 2)  and  G  =  (a^,  a^)  . 

Let 


L(0,  a)  = 


Then  the  risk  set  is  the  set  (o  ,  (3  ),  corresponding  to  the  probabilities  of 

ip  <p 

errors  of  the  first  and  second  kind  using  the  decision  rule  <p  t  ®  .  A  typical 
risk  set  for  such  a  problem  is  shown  below. 
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Here  it  appears  natural  to  set  v  =  {0,  0),  which  corresponds  to  a 
"perfect"  test,  that  is,  one  with  size  zero  and  power  unity.  The  different 
choices  of  X  correspond  to  different  ways  of  defining  the  point  in  S  which 
is  closest  to  (0,  0)  . 

We  now  return  to  Theorems  3. 1  and  3.  2  to  reexamine  them  in  the  light  of 
statistical  decision  theory,  rather  than  approximation  theory  First  note  that 
the  only  significant  distinction  in  the  conversion  to  the  statistical  problem  is 
the  additional  assumption  v  <  s  .  This  readily  leads  to  the  following,  which 
we  state  as  Theorem  3.  3. 

Theorem  3.  3.  Let  be  a  convex  set  in  t,  a  normed  linear  space  and 
let  v  <  s,  for  all  s  *  S.  .  Then,  s  is  a  best  approximation  to  v  if  and 

W  V 

* 

only  if  there  exists  a  positive  linear  functional  L  t  £  satisfying  conditions 
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(1),  (2)  and  (3)  of  Theorem  3.1.  Similarly,  sQ  is  a  best  approximation  to  v 

if  and  only  if  for  each  s  e  S  ,  there  is  a  positive  linear  functional  L  in 

t  s 

* 

£  satisfying  conditions  (1),  (2)  and  (3)  of  Theorem  3.2. 

Some  of  the  more  immediate  consequences,  for  example,  include  the 
following  well-known  result  in  statistical  decision  theory.  If  all  the  risk 
functions  in  S^,  are  continuous  and  S£  *  then  the  space  £  determined  by 
the  sup  norm  coincides  with  £ ^  space,  say  if  e  C  En  .  The  adjoint  space 
is  £j  space  and  the  positive  linear  functionals  of  norm  unity  are  then  the 
probability  distributions  on  e,  using  the  usual  Borel  ir-algebra  on  0  .  Then 
the  linear  functional  L  of  Theorem  3.1  is  the  least  favorable  distribution  (for 
either  minimax  or  minlmax  regret,  depending  on  the  choice  of  v  from  the  two 
alternatives  specified  earlier).  In  a  less  restrictive  context,  the  linear  func¬ 
tional  L  of  Theorem  3.1,  gives  a  prior  distribution  against  which  sQ  is 
Bayes,  provided  one  adds  a  few  minor  regularity  conditions.  Statistical  in¬ 
terpretations  of  Theorem  3.2  are  not  as  useful,  although  these  can  be  inferred 
as  well. 

4.  Oplmlzatlon  in  statistical  decision  theory  defined  by  minimizing  functionals 
In  this  section  we  introduce  another  method  of  defining  optimization  in 
statistical  decision  problems. 

Let  M  be  a  class  of  extended  real  valued  functionals  h  on  a  linear 
topological  space  ill  of  real  valued  functions  of  9,  0  «  ©,  with  the  following 
properties: 
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(1)  h(0)  =  0,  |h(-M)l<  oo,  where  -M  denotes  the  function 

f (0)  =  M,  0  t  e  . 

(2)  If  x  <  y,  h(x)  <  h{y)  . 

Then  we  say  that  sQ«  SCU  Is  h-optimal  if  h(sQ)  <  h(s)  for  all  s«  S  . 

If  h(s)  =  +  oo  for  ail  s  e  S,  then  every  s  «  S  is  said  to  be  h-optimal. 

Since  every  norm  is  a  functional,  h-optimality  reduces  to  (v,  £) 
optimality  in  some  particular  instances.  That  is,  if  h(s)  =  j|  s  -vll^,  when 
II  s  vK^,  <  oo  and  h(s)  =  oo,  otherwise,  then  this  is  precisely  (v,  £)  optimality. 
However,  there  are  many  instances  of  h-optimality  which  are  not  expressible 
in  terms  of  norms,  and  hence,  this  is  in  fact  a  generalization  of  (v,  JE)  op¬ 
timality. 

It  is  instructive  to  examine  the  principle  of  Bayesian  inference  in  the 
light  of  the  definition  of  h-optimality.  Let  p  be  a  probability  measure 
(equivalently  any  measure  p  with  p(6)<oo  )  on  the  Borel  sets  of  e  .  Then 
sQ  is  the  Bayes  decision  rule  with  respect  to  p  if 

(4.1)  /  R  (0)dp(0)</  R  (0)dp(0) 

6  V0  6  V 

for  all  <p  (  9  and  sQ  =  T<?0)  .  Note  that  we  need  the  additional  assumption 
that  R  (0)  is  measurable  with  respect  to  the  Borel  o- -algebra  on  e  for  every 

<p  «  $ 

Observe  that  (4. 1)  is  a  statement  of  minimization  with  respect  to  a  linear 
functional.  Since,  as  a  consequence  of  the  Riesz  representation  theorem, 
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every  positive  linear  functional  has  a  representation  of  the  form  employed  in 
(4.1)  for  compact  0,  Bayesian  inference  coincides  in  this  case  with  h-optim- 
ality  for  linear  h  . 

This  observation  provides  an  obvious  explanation  for  the  assertion  that 
the  solution  to  a  Bayesian  problem  is  obtained  more  easily  than  the  solution 
using  other  criteria.  Namely,  in  the  sense  employed  here,  Bayesian  problems, 
are  linear  problems,  that  is,  minimization  with  respect  to  linear  functionals. 
Other  optimization  principles  are  generally  non-linear  in  this  sense. 

We  can  relate  this  principle  to  a  growing  body  of  mathematical  literature 
as  well  by  noting  that  this  is  precisely  the  structure  of  mathematical  pro¬ 
gramming  problems.  The  function  h  becomes  the  objective  function  of  the 
mathematical  programmer  and  the  convex  set  S  is  the  set  of  feasible  points. 
Hence  Bayesian  inference  is  linear  programming  from  this  point  of  view. 
However,  the  set  of  constraints  required  to  generate  S  need  not  necessarily 
be  finite.  The  general  optimization  problem  is,  in  general,  a  problem  of 
non-linear  programming  in  an  arbitrary  linear  space. 

Some  useful  work  in  this  area  which  can  be  exploited  by  statisticians 
are  J.  W.  Daniel  [6,7],  K.  Kirchgassner  and  K.  Ritter  [15  ],  K.  Ritter  [22], 
and  L.  W.  Neustadt  [19  ].  In  particular,  it  should  be  noted  that  some  re¬ 
lationships  between  mathematical  programming  and  the  areas  of  statistics  and 
probability  are  quite  well-known.  An  extensive  discussion  of  these  and  a 
substantial  bibliography  may  be  found  in  the  survey  paper  by  O.  Krafft  [16  ]. 


-16- 


#1160 


5.  Axlomatlzations  of  optimality  In  statistical  decision  theory.  Another  method 


of  defining  optimality  in  statistical  decision  problems  is  to  list  properties  that 
you  would  like  such  a  procedure  to  possess.  I  will  briefly  summarize  the 
history  of  this  topic  and  conclude  with  the  statement  of  some  recent  results 
by  E.  E.  Nordbrock  and  some  of  their  consequences. 

In  H.  Chernoff  [  5  ],  a  set  of  eight  postulates  is  exhibited  for  a  finite 
decision  problem  (6,  6  both  finite).  For  these  postulates,  Laplace's  criterion 
is  the  only  rule  which  satisfies  all  eight.  Chernoff  notes  that  if  an  additional 
postulate  were  added  to  the  list;  a  postulate  of  the  "nature  duplication"  type 
to  be  discussed  below,  then  a  contradiction  would  result.  Chernoff  s  results 
"justifying"  Laplace's  criterion  were  extended  to  more  general  decision  problems 
by  H.  Uzawa  [27,28]- 

In  J.  Milnor  [18  ],  a  list  of  ten  postulates  for  a  finite  decision  problem  are  given 
Subsets  of  these  which  characterize  Laplace's  criterion,  minimax,  and  minlmax 
regret  are  exhibited.  It  should  be  noted  that  minimax  regret  was  proposed  by 
L.  J.  Savage  in  [24,25],  and  is  referred  to  as  Savage's  criterion  by  Milnor. 

Milnor  also  exhibited  a  set  of  eight  postulates  which  are  consistent  and  gave  a 
construction  of  a  rule  which  satisfies  these  postulates.  His  rule  is  a  precursor 
of  the  rule  used  by  Atkinson,  Church,  and  Harris[i],  which  will  be  discussed  in 
greater  detail  later.  Good  [14  ],  proposed  a  restricted  type  of  minimax  rule. 

The  thirteenth  chapter  of  R.  D.  Luce  and  H.  Raiffa  [17  ]  and  the  paper  of 
R.  Radner  and  J.  Marschak  [  21  ]  provide  expository  treatments  of  decision 
principles. 
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In  Atkinson,  Church,  and  Harris,  the  following  set  of  postulates  were 

proposed  for  the  finite  decision  problem  ©  =  {0^,  . . . ,  0^}  and  &=  {d^,  d^, 

.  . . ,  d  };  here  S  is  a  convex  polyhedron  in  E  and  is  the  convex  hull  of 
m  n 

the  row  vectors  of  the  matrix  A  =  {a^},  where  a  =  L(d^,0p  . 

1.  The  optimal  class  Q(A)  is  non-empty. 

2.  Let  iTj  and  it  be  permutations  acting  on  ©  and  &  respectively. 

Then  if  A'  =  {a!  }  =  {a  ,  },  Q(A’)  is  the  set  of  points  of  S  obtained  by 

1  (1)%  (1) 

applying  it  to  the  coordinates  of  points  in  Q(A)  . 

3.  Every  element  of  Q(A)  is  admissible. 

4.  Q(A)  is  a  convex  subset  of  S  . 


5.  If  AJ  =  \AQ  + 


c  c.  ...  c 
12  n 

c  c  .  .  .  c 
12  n 


,  \  >  0  , 


c.  c_  .  . .  c 
12  n 


then  Q(A1)  =  {\x  +  c  ,  x  c  Q(Aq)},  where  c  =  (Cj,  ...,  cn)  . 

T  T  T 

6.  Let  co(Aj )  =  coiA^),  where  A  is  the  transpose  of  A  and  co(A) 
is  the  convex  hull  of  the  row  vectors  of  A  .  If  Aj  is  obtained  from  by 
deleting  j  columns  from  A2>  then  Q(Aj)  is  obtained  from  Q(A2)  by  deleting 
the  corresponding  ]  coordinates  from  each  element  of  Q^)  . 

Remark:  This  type  of  postulate  is  usually  called  a  "nature  duplication" 
postulate. 

7.  If  for  two  statistical  decision  problems,  A^  and  A^  with  risk  sets 
Sj  and  S2,  the  points  of  which  are  both  extreme  points  and  admissible 
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coincide  with  the  points  of  which  are  both  extreme  points  and  admissible, 

then  QfAj)  =  Q(&2)  • 

.  .00 

8.  If  {An}n converges  to  AQ  and  x^  c  Q(A^)  for  every  n,  then 

every  limit  point  of  {x  }°°  ,  is  in  Q(A.)  .  (Here  convergence  of  {A  }  is 

n  n=l  0  n 

element  by  element). 

In  [  1  ],  it  was  exhibited  that  these  postulates  are  consistent  for  finite 

decision  problems  and  a  decision  rule  called  "iterated  minimax  regret"  (IMR) 

was  shown  to  satisfy  all  eight  postulates.  IMR  is  closely  related  to  the  rule 

given  by  Milnor  [18]  and  Is  described  below. 

The  iterated  minimax  regret  principle  (IMR)  selects  any  element  sQ  «  S 

as  optimal  which  is  obtained  by  the  following  process. 

Let  vj  =  Vj(0)  =  inf  R  (0)  and  let  Z  be  the  normed  linear  space  with, 

<P 

for  s  =  T(<p),  II  s||  =  ||  R  (0)||  a  supR  (0)  .  Let  z.  =  inf  II  s  -v  ||  and  let  {e  }°°  ,  be  a 

*  0«e  *  1  s*s  1  n  nal 

sequence  of  positive  numbers  with  lim  e  =  0  .  Let  Q,  =  S  and  inductively, 

n  l 

n-»co 

for  n  >  1,  define  Q  =  {st  Q  :  ||s-v  1<  z  +  e  z, }  where  v  =inf  R.„(0) 

—  n+i  n  n  —  n  n  1  n  y 

<P*  T_i(Sn) 

and  z  =  inf  II  s  -v  ||  .  If  z.  =  «  then  all  st  S  are  said  to  be  optimal, 
n  _  n  i 

s  €  S  oo 

If  z.  <  oOj  define  Q  =  O  Q  and  choose  as  s  any  element  of  Q  . 

1  n=l  n  0 

B.  Efron  [10]  extended  these  results  in  part  to  infinite  decision  problems. 

Here  some  modifications  in  the  postulates  must  be  made  and  these  are  listed 
below  in  the  form  given  by  E.  E.  Nord brock  [20]. 

2'.  If  h:  6  -*  e'  is  a  homeomorphism  and  if  S'  =  (R‘  (0');  0‘  c  ©';  R'  = 

R  *  h},  then  Q'  =  Q •  h  . 
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5'.  If  S'  =  \S  +  c,  where  c  =  c(0)  is  a  continuous  function  of  9  , 


then  Q'  =  XQ  +  c  . 


6'.  Let  S'  =  {R  (0')}  and  S  =  {R  (6)  =  r|©},  where  r|©  means  the 

<p  <p 

restriction  of  the  domain  of  R  to  e  and  ec  ©'  .  Then  if  for  every  0  c  ©'  , 
there  is  a  probability  measure  p.  on  ©  such  that  for  all  <p  «  ®,  we  have 


7'  If  S  and  S'  have  a  common  complete  class,  then  Q  =  Q*  . 

8'  Define  d(S,  S')  =  max  {  sup  inf  ||r-s||,  sup  inf  || r  -  s  II  }  . 

s c  S  re  S'  re  S’  se  S 

Then  if  d(Sn,  S)-»  0  as  n  -*  °°  and  if  s^  e  Qn  for  all  n,  and  if 
d(s^n\  s)  -*  0,  then  s  e  Q  . 

Efron  showed  that  these  postulates  (1,  2',  3,  4,  5‘,  6‘,  7',  8‘)  are  satisfied 
by  IMR  for  S  a  closed  bounded  convex  set  in  En  .  He  also  claimed  that 
with  the  exception  of  postulate  1  and  the  weakening  of  the  conclusion  of 
postulate  8  to  s  e  Q,  this  holds  for  closed  bounded  convex  sets  in  L 
However,  the  following  counter  example  shows  that  inadmissible  decision 
procedures  may  result. 

Example  5. 1  Nordbrock  [20].  Let  ©  =  {1,  2,  ...  }  and  let  6=  {0,  1,  2,  ...  }  . 

For  6  =  1,  2,  . . . ,  let 


R6<®> 


6  =  9 
6*9 


and  let 


R q(9)  =  1  for  all  9  «  ©  . 
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Let  S  be  the  closed  convex  hull  of  {R_ (©) }  .  Then  it  is  easily  seen  that 

O 

00 

O  Q  =  0.  =  S  and  s.  =  FL(0)  is  inadmissible. 

.  n  l  oo 

n=l 

We  now  conclude  with  a  statement  of  Nordbrock's  results. 

S  is  said  to  be  weak  intrinsically  compact  (Wald[30]),  if  for  every 

sequence  {s  }  €  S,  there  is  a  subsequence  {s  }  and  an  s'  «  S  such  that 
n  n, 

k 

lim  inf  R  (0)  >  R '(6)  for  all  9,  where  s  =  R  (6)  and  s'  =  R'(0)  . 
nk  ~  nk  nk 

Theorem  5. 1.  IMR  satisfies  properties  2‘,  4,  5',  6'  generally.  Property  1 
holds  if  S  is  weak  intrinsically  compact.  Property  8'  holds  if  S  is  closed 
and  properties  3  and  7  hold  if  S  is  compact. 

6.  Summary.  In  this  exposition,  I  have  attempted  to  give  some  illustrations 
of  the  possible  directions  in  which  the  mathematical  foundations  of  statistical 
decision  theory  might  be  developed.  The  limitations  of  this  volume  preclude 
the  extensive  development  of  these  ideas  which  are  necessary  in  order  to  de¬ 
termine  its  possible  impact  on  the  subject  of  theoretical  statistics.  However, 
it  is  hoped  that  this  brief  exposition  will  encourage  research  workers  to  further 
examine  the  implications  of  the  ideas  contained  herein.  At  this  stage  it  is  not 
yet  apparent  whether  the  material  of  sections  three  and  four  will  in  fact  produce 
new  basic  results  in  statistics  as  such.  To  date,  it  is  possible  to  identify 
many  familiar  statistical  results  in  the  writings  of  functional  analysts  and 
further  display  the  correspondence  between  these  two  areas.  The  notion  of 
Iterated  minlmax  regret  developed  in  the  fifth  section  is  handicapped  by  its 
apparent  Incomputability,  except  in  rather  artificial  examples.  The  principle 
has  not  been  successfully  applied  to  any  concrete  statistical  problem  as  yet. 
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